


1. You have this data for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 
25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. 



a) 



Bin 1: 13,15,16,16 


Bin 1: 15,15,15,15 


Bin 2: 19,20,20 


Bin 2: 20,20,20 


Bin 3: 21,22,22 


Bin 3: 22,22,22 


Bin 4: 25,25,25,25 


Bin 4: 25,25,25,25 


Bin 5: 30,33,33 


Bin 5: 32,32,32 


Bin 6: 35,35,35,35 


Bin 6: 35,35,35,35 


Bin 7: 36,40,45 


Bin 7: 40,40,40 


Bin 8: 46,52,70 


Bin 8: 56,56,56 


The problem of sorted by mean sometimes it c 


oesn't imply the real 


values of the Bin 



b) Potter's Wheel -> Automated interactive data cleaning tool 

c) Data smoothing techniques: 

1. Binning 

2. Regression 

3. Outlier Analysis 
d) 

Max value is 70 
Min value is 13 

v t — min A 

v i = (new_max A — new_min A ) + new_min A 

max A — min A 

35-13 
1 70-13 

e) 



(1 - 0) + = 0.39 



The Mean = 809/27 = 29.9 
The Standard Deviation = 12.94 



Vi- 


A 


<*A 




35- 


-29.9 



V i = 

1 12.94 



= 0.39 



f) z-score normalization 

As it used for attribute value normalized based on mean and SD 

g) 
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Question 2 



2) Suppose that a hospital tested the age and body fat data for 18 randomly selected adults 
with the following results: 



Age 


23 


23 


27 


27 


39 


41 


47 


49 


50 


52 


54 


54 


56 


57 


58 


58 


60 


61 


%fat 


9.5 


26.5 


7.8 


17.8 


31.4 


25.9 


27.4 


27.2 


31.2 


34.6 


42.5 


28.8 


33.4 


30.2 


34.1 


32.9 


41.2 


35.7 



a) 

The mean Age = 836 / 18 = 46.4 

The median Age= (50+52)/ 2 = 51 

The standard deviation Age = 13 

b)for age 

23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61 



Fat = 518.1/ 18 = 28.7 
Fat = (31.2+34.6)/ 2 = 30.7 
Fat = 9.25 



0.1=39 



0.2=51 



Q3=57 



Min23 



Max 61 



max = 61 



Q3 = 57 



Q2 =median = S1 



Q1 = 39 



Tin = 27 



23 is outlier 



For %fat 

7.8 9.5 17.8 25.9 26.5 27 '.2 27 '.4 28.8 30.2 31.2 31.4 32.9 33.4 34.134.6 35.7 41.2 
42.5 



0.1=26.5 0.2=30.7 Q3=34.1 Min 7.8 



Max 42.5 



-+■ max = 42.5 



Q3 



34.1 



Q2 =median - 



30.7 



Q1 = 26.5 



min = 1 7.8 



9.5 is outlier 
7.8 is outlier 



c) z-score normalization for 50 and 31.2 
max value 61 min value 23 
for 50: 

= Vj-A == S_0-46A = 

1 <T A 13 



for 31.2 



, _ V~A 31.2-28.7 

1 ~ <J A 9.25 



= 0.27 



d) The mean Age = 836/ 18 = 46.4 Fat = 518.1 / 18 = 28.7 

Correlation coefficient = 

23 » 9.5 + 23 « 26.5 + 27 « 7.8 + 27 « 17.8 + 39 « 31.4 + 41 » 25.9 + 47 » 27.4 + 49 « 27.2 + 50 « 31.2 + 52 « 34.6 + 54 « 28.8 + 54 » 42.5 + 56 » 33.4 + 57 « 30.2 + 58 » 43.1 + 58 » 32.9 + 60 « 41.2 + 61 * 35.7 

18 - 46.4x28.7 

=-19.6 



The coefficient factor is negative (negatively correlated) 



Question 3 



3- Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): 

a) Euclidean distance 

V(22 - 20) 2 + (1 - 0) 2 + (42 - 36) 2 + (10 - 8) 2 = V(2) 2 + (l) 2 + (8 2 + (2) 2 = 6.7 

b) Manhattan distance 

|22 - 20| + |1 - 0| + |42 - 36| + |10 - 8|=11 
C) Minkowski distance 

V(22 - 20) 3 + (1 - 0) 3 + (42 - 36) 3 + (10 - 8) 3 = 6.1 



